Initially created at 13.03.2018 by Petteri Nevavuori (petteri.nevavuori@mtech.fi)
In this notebook we'll train several CNNs with Drone datasets and research if the drone images are valid with just themselves as inputs in predicting the yield outputs. We will perform training by the datasets generated at the previous notebook. We will use the images as inputs and use area-wise means as training targets.
The CNN will effectively comprise of a multilayer CNN connected to several linear layers for yield prediction. Some research questions:
The first comparison point is the optimizer. While some hints were already provided in the CNN building phase, we'll asses the differences more distinctly here. We will compare vanilla implementations of PyTorch's SGD with momentum, RMSProp and Adadelta. The CNN will by default use SGD with momentum, so we'll introduce explicitly only the RMSProp and Adadelta. We won't use early stopping yet, as we want to see how the training progresses.
During the course of training the models with varying optimzier it was noticed that sometimes a batch size too large would result in the optimizer failing to reduce the objective loss. Thus we will explore the optimizer limits. The initial intuition is that there exists a dual limit to batch size. The first is the GPU memory and the second is a level above which the optimizer switches from functional to detrimental.
We will test every batch size with triple initialization. This is to see whether random initialization of model's parameters has a noticeable role.
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import numpy as np
from torch import optim
from field_analysis.model.dataset.dataperiod import DroneNDVIEarlier, DroneRGBEarlier
from field_analysis.model.nets.cnn import DroneYieldMeanCNN
db_32 = 'field_analysis_10m_32px.db'
db_64 = 'field_analysis_20m_64px.db'
db_128 = 'field_analysis_40m_128px.db'
dbs = [db_32, db_64, db_128]
def test_optimizer_batch_size(optimizer):
plt.rcParams['figure.figsize'] = 10, 3
batch_sizes = [32*2**x for x in range(6)]
for i, source_dim in enumerate([32, 64, 128]):
for j, dataset in enumerate([DroneNDVIEarlier, DroneRGBEarlier]):
ds_name = "NDVI"
if j == 1:
ds_name = "RGB"
for batch_size in batch_sizes:
losses = []
losses_deltas = []
try:
for k in range(3):
train, test = dataset(dbs[i]).separate_train_test(
batch_size=batch_size,
train_ratio=0.8)
cnn = DroneYieldMeanCNN(
source_bands=max(1, 3*j),
source_dim=source_dim,
optimizer=optimizer)
losses_dict = cnn.train(
epochs=3,
training_data=train,
test_data=test,
visualize=False,
suppress_output=True,
save_model=False)
losses.append(np.array(losses_dict['test_losses_mean_std'])[:,0].min())
except Exception as ex:
pass
if len(losses) > 0:
losses = np.array(losses)
plt.scatter([batch_size]*len(losses), losses, alpha=0.5)
plt.errorbar(batch_size, losses.mean(),
losses.std(), capsize=6, marker='o')
plt.title('Best Test Losses for {} {}x{}'.format(ds_name, source_dim, source_dim))
plt.xlabel('Batch Size')
plt.ylabel('$\mu_{Loss}$')
plt.xticks(batch_sizes)
plt.ylim(bottom=0)
plt.xlim(16, 1040)
plt.grid()
plt.tight_layout()
plt.show()
test_optimizer_batch_size(optimizer=None)
test_optimizer_batch_size(optimizer=optim.RMSprop)
test_optimizer_batch_size(optimizer=optim.Adadelta)
Here are the results from trying out multiple batch size in range $[2^5, 2^{10}]$. The results are given for each optimizer and dataset and presented in table below. The columns represent the batch sizes and the rows the batch-wise feasibilities for each optimizer and dataset. The possible outcomes are feasible (Y), not feasible (N) and over the GPU memory limit (-).
The feasibility is determined by whether the optimizer was able to start minimizing the test error during three epochs. A telltale sign of the inability to minimize is when the values stay around the level of the median of the target values. This essentially means that the network produces values close to zero while the absolute target values are around 6500. As each dataset-optimizer-pair is initialized and trained three times, a pair is feasible when majority of initializations provide sufficient minimization. With three initialization this means that one non-minimizing initialization is within the threshold.
| SGD | 32 | 64 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|
| NDVI 32 | Y | Y | Y | Y | N | N |
| RGB 32 | Y | N | Y | Y | N | N |
| NDVI 64 | Y | Y | Y | Y | N | N |
| RGB 64 | Y | Y | Y | Y | N | N |
| NDVI 128 | Y | Y | Y | Y | N | - |
| RGB 128 | Y | Y | Y | Y | N | - |
| RMSprop | 32 | 64 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|
| NDVI 32 | Y | Y | Y | Y | Y | Y |
| RGB 32 | N | N | Y | Y | Y | Y |
| NDVI 64 | Y | Y | Y | N | Y | N |
| RGB 64 | Y | Y | Y | Y | N | Y |
| NDVI 128 | Y | N | Y | Y | Y | - |
| RGB 128 | N | Y | Y | Y | Y | - |
| Adadelta | 32 | 64 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|
| NDVI 32 | Y | Y | Y | Y | Y | Y |
| RGB 32 | Y | Y | Y | Y | Y | Y |
| NDVI 64 | Y | Y | Y | Y | Y | Y |
| RGB 64 | Y | Y | Y | Y | Y | Y |
| NDVI 128 | Y | Y | Y | Y | Y | - |
| RGB 128 | Y | Y | Y | Y | Y | - |
Adadelta seems to be the most robust while the RMSprop is the pickiest one. Also the way the RMSprop behaves induces mistrust towards it as it seems quite unreliable. The comparison will thus continue with SGD and Adadelta only and with a fixed batch size of 128. This is because they are reliable and do not seemingly randomly tend to fail to start initializing. This is the case with RMSprop, as there is no clear distinction in the setting that causes the optimization to fail. While SGD fails at times, it is clearly with higher batch sizes. RMSprop fails from the smallest to the largest batch sizes.
We'll begin by looking at the NDVI datasets first. Initially we'll use a tad deeper topology, as it proved to have a large enough capacity to show distinction between training and test losses. This is a good thing as the model is able to fit better. This also makes the use of regularization viable in driving the test losses down. Each model is trained for 50 epochs.
Then we'll do the same comparison with RGB images. We will train each dataset with each of the optimizers and see how they manage.
import os
import pandas as pd
import numpy as np
from torch import optim
from field_analysis.model.dataset import dataperiod as dp
from field_analysis.model.nets.cnn import DroneYieldMeanCNN
import field_analysis.settings.model as model_settings
%matplotlib inline
db_32 = 'field_analysis_10m_32px.db'
db_64 = 'field_analysis_20m_64px.db'
db_128 = 'field_analysis_40m_128px.db'
dbs = [db_32, db_64, db_128]
optimizer_models_dir = os.path.join(model_settings.MODELS_DIR,'optimizer')
os.makedirs(optimizer_models_dir,exist_ok=True)
optimizers = [None, optim.Adadelta]
def test_optimizer(dataloader):
losses = pd.DataFrame()
for i, db in enumerate(dbs):
dataset = dataloader(db_name=db)
dataset_name = dataset.__class__.__name__
source_bands = 1 # NDVI
if 'RGB' in dataset_name:
source_bands = 3
for optimizer in optimizers:
source_dim = 32*(2**i)
if optimizer is not None:
optim_name = 'Adadelta'
else:
optim_name = 'SGD'
cnn = DroneYieldMeanCNN(
source_bands=source_bands,
source_dim=source_dim,
cnn_layers=6,
fc_layers=2,
optimizer=optimizer)
cnn.model_path = os.path.join(optimizer_models_dir,cnn.model_filename)
print(cnn.model_path)
losses_dict = cnn.train(
epochs=50,
training_data=dataset,
k_cv_folds=3,
suppress_output=True)
best_loss = np.array(losses_dict['test_losses_mean_std'])[:, 0].min()
losses.loc[source_dim,optim_name] = best_loss
return losses
result_earlier_ndvi = test_optimizer(dataloader=dp.DroneNDVIEarlier)
result_later_ndvi = test_optimizer(dataloader=dp.DroneNDVILater)
result_earlier_rgb = test_optimizer(dataloader=dp.DroneRGBEarlier)
result_later_rgb = test_optimizer(dataloader=dp.DroneRGBLater)
First we'll take a look at the test losses produced with distinct datasets for each optimizer.
pd.options.display.float_format = '{:.2f}'.format
The following tables show the best test L1-losses with distinct datasets and optimizers. The first table is for the earlier dataset with pre-July Drone NDVI images:
result_earlier_ndvi
result_later_ndvi
Let's pull up the tables for period-wise lowest L1-losses with only 50 epochs and no tuning. First one is the table for pre-July RGB datasets:
result_earlier_rgb
result_later_rgb
After ruling out the RMSprop in the optimal batch size exploration stage already the comparison was conducted between the SGD with momentum and Adadelta. Adadelta produced the best results out every training configuration. This means that it succeeded better in utilizing the capacity of the model. We will thus use Adadelta as the optimizer.
Next up is comparing several depths for the CNN component of the network. We will attempt at keeping the FC layers at two to really isolate the CNN performance. A good result is achieved when the network is able to even overfit. That means the capacity is sufficient and allows for the utilization of regularization to drive the test error down.
Even though the number of total trainings is high (48 distinct trainings), we will still go through them. In the later stages of the optimization we will use only some of the datasets if similar results are produced to the comparison of optimizers. We will also increase the number of epochs to see where the deeper models would progress.
import os
import pandas as pd
import numpy as np
from torch import optim
from field_analysis.model.dataset import dataperiod as dp
from field_analysis.model.nets.cnn import DroneYieldMeanCNN
import field_analysis.settings.model as model_settings
%matplotlib inline
db_32 = 'field_analysis_10m_32px.db'
db_64 = 'field_analysis_20m_64px.db'
db_128 = 'field_analysis_40m_128px.db'
dbs = [db_32, db_64, db_128]
depth_models_dir = os.path.join(model_settings.MODELS_DIR,'depth')
results_dir = os.path.join(os.getcwd(),'results')
os.makedirs(depth_models_dir,exist_ok=True)
os.makedirs(results_dir,exist_ok=True)
def test_depth(dataloader, bands):
depths = list(range(4, 14, 2))
multi_index = pd.MultiIndex.from_product([[32,64,128],depths])
losses = pd.DataFrame(index=['test','train'],columns=multi_index)
for i, db in enumerate(dbs):
dataset = dataloader(db_name=db)
dataset_name = dataset.__class__.__name__
source_bands = 1 # NDVI
if 'RGB' in dataset_name:
source_bands = 3
for depth in depths:
source_dim = 32*(2**i)
cnn = DroneYieldMeanCNN(
source_bands=bands,
source_dim=source_dim,
cnn_layers=depth,
fc_layers=2,
optimizer=optim.Adadelta)
cnn.model_path = os.path.join(depth_models_dir,cnn.model_filename)
print(cnn.model_path)
losses_dict= cnn.train(
epochs=50,
training_data=dataset,
k_cv_folds=3,
suppress_output=True)
best_test_loss = np.array(losses_dict['test_losses_mean_std'])[:, 0].min()
best_train_loss = np.array(losses_dict['training_losses_mean_std'])[:, 0].min()
losses.loc['test',(source_dim,depth)] = best_test_loss
losses.loc['train',(source_dim,depth)] = best_train_loss
return losses
def test_depth_single(dataloader, bands, db, depth, dim):
depths = list(range(4, 14, 2))
multi_index = pd.MultiIndex.from_product([[32,64,128],depths])
losses = pd.DataFrame(index=['test','train'],columns=multi_index)
dataset = dataloader(db_name=db)
dataset_name = dataset.__class__.__name__
source_bands = 1 # NDVI
if 'RGB' in dataset_name:
source_bands = 3
source_dim = dim
cnn = DroneYieldMeanCNN(
source_bands=bands,
source_dim=source_dim,
cnn_layers=depth,
fc_layers=2,
optimizer=optim.Adadelta)
cnn.model_path = os.path.join(depth_models_dir,cnn.model_filename)
print(cnn.model_path)
losses_dict= cnn.train(
epochs=50,
training_data=dataset,
k_cv_folds=3,
suppress_output=True)
best_test_loss = np.array(losses_dict['test_losses_mean_std'])[:, 0].min()
best_train_loss = np.array(losses_dict['training_losses_mean_std'])[:, 0].min()
losses.loc['test',(source_dim,depth)] = best_test_loss
losses.loc['train',(source_dim,depth)] = best_train_loss
return losses
First, as with the optimizer, we'll go through the NDVI datasets. Then the RGB ones.
depth_ndvi_earlier = test_depth(dp.DroneNDVIEarlier, 1)
depth_ndvi_earlier.to_csv(os.path.join(results_dir,'depth_ndvi_earlier.csv'))
print("NDVI Earlier")
depth_ndvi_earlier
depth_ndvi_later = test_depth(dp.DroneNDVILater, 1)
depth_ndvi_later.to_csv(os.path.join(results_dir,'depth_ndvi_later.csv'))
print("NDVI Later")
depth_ndvi_later
# depth_ndvi_later_single = test_depth_single(dp.DroneNDVILater, 1, db_64, 10, 64)
# depth_ndvi_later.loc[:,('64','10')] = depth_ndvi_later_single.loc[:,(64,10)]
# depth_ndvi_later.to_csv(os.path.join(results_dir,'depth_ndvi_later.csv'))
depth_rgb_earlier = test_depth(dp.DroneRGBEarlier, 3)
depth_rgb_earlier.to_csv(os.path.join(results_dir,'depth_rgb_earlier.csv'))
print("RGB Earlier")
depth_rgb_earlier
depth_rgb_later = test_depth(dp.DroneRGBLater, 3)
depth_rgb_later.to_csv(os.path.join(results_dir,'depth_rgb_later.csv'))
print("RGB Later")
depth_rgb_later
depth_rgb_later_single = test_depth_single(dp.DroneRGBLater, 3, db_32, 8, 32)
depth_rgb_later.loc[:,('32','8')] = depth_rgb_later_single.loc[:,(32,8)]
depth_rgb_later.to_csv(os.path.join(results_dir,'depth_rgb_later.csv'))
With so many numbers it is actually starting to get a bit hard to grasp the progression. We therefore take these numbers and plot them out. The error values for the unfitted will be handled as NaNs to ensure proper scaling of the plots. The plotted areas use the training error as the upper bound and the test error as the lower bound.
import pandas as pd
import numpy as np
import os
depth_ndvi_earlier=pd.read_csv(os.path.join(results_dir,'depth_ndvi_earlier.csv'),index_col=0,header=[0,1])
depth_ndvi_later=pd.read_csv(os.path.join(results_dir,'depth_ndvi_later.csv'),index_col=0,header=[0,1])
depth_rgb_earlier=pd.read_csv(os.path.join(results_dir,'depth_rgb_earlier.csv'),index_col=0,header=[0,1])
depth_rgb_later=pd.read_csv(os.path.join(results_dir,'depth_rgb_later.csv'),index_col=0,header=[0,1])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
plt.rcParams['figure.figsize'] = 12, 16
hatches = ['/', None, '|']
x = list(range(4, 14, 2))
result_sets = [depth_ndvi_earlier, depth_ndvi_later, depth_rgb_earlier ,depth_rgb_later]
result_set_names = ['NDVI Earlier','NDVI Later','RGB Earlier','RGB Later']
for i, label in enumerate(['10m', '20m', '40m']):
window_px = str(32*2**i)
for j, result_set in enumerate(result_sets):
plt.subplot(411+j)
plt.fill_between(x,
list(result_set.loc['train',window_px].values),
list(result_set.loc['test',window_px].values),
label=label,
hatch=hatches[i],
edgecolor='gray',
alpha=0.4)
plt.xticks(x)
plt.xlim([4, 12])
plt.legend()
plt.grid()
plt.xlabel("Depth")
plt.ylabel("Mean Absolute Error")
plt.title(f"Generalization Gaps for {result_set_names[j]}")
plt.tight_layout()
plt.savefig(os.path.join(os.getcwd(),'results','cnn-depth.png'),
dpi=300, bbox_inches='tight', pad_inches=0.1)
plt.show()
The optimal result is achieved with depth of 6 layers and the 128px/40m image dataset. The test losses are however notably lower for the RGB images than for the NDVI images.
We have already concluded that the optimal optimizer to use is the Adadelta and the optimal CNN depth is 8 layers. We have also concluded as a side product of the optimal depth research that minimal test loss is best achieved using 128px RGB datasets. Now it is time to attempt at drivin the test loss down by means of regularization. We have two possibilities, early stopping and weight decay.
The hyperparameter for the early stopping is the number of consecutive non-improving training iterations to perform before terminating the training. We will also implement a second boolean hyperparameter for whether to continue training after first termination. The hyperparameter for the weight decay is the decay coefficient. We will try out several values with random search. Thus we will try out a fixed amount of trainings and see which one of them produces the lowest test error.
We will set the hyperparameter value ranges as follows:
patience): $[10, ... , 50]$weight_decay): $[0.0, ... , 1.0]$While this is just a comparison, we'll continue using a limited amount of epochs to see how the overall progression of the training proceeds. We'll also proceed byt first testing the weight decay and only after that the early stopping. This is because the training times grow exceedingly after the number of epochs is in ballpark where early stopping is able to show its effectiveness.
import os
import shutil
import numpy as np
import pandas as pd
import torch
from torch import optim
from field_analysis.model.dataset import dataperiod as dp
from field_analysis.model.nets.cnn import DroneYieldMeanCNN
import field_analysis.settings.model as model_settings
%matplotlib inline
DB_128 = 'field_analysis_40m_128px.db'
DATASET_NAMES = ['earlier', 'later']
EPOCHS = 50
regularized_models_dir = os.path.join(model_settings.MODELS_DIR,'regularization')
os.makedirs(regularized_models_dir,exist_ok=True)
def copy_model(cnn, is_later, save):
"Copy the dataset-wise persisted model either for later use (`save=True`) or current use (`save=False`)."
cnn.model_path = os.path.join(regularized_models_dir,cnn.model_filename)
model_folder, _ = os.path.split(cnn.model_path)
model_name, suffix = cnn.model_filename.split('.')
model_name = "initial_model_{}.{}".format(
DATASET_NAMES[is_later], suffix)
if save:
cnn.save_model()
from_path = cnn.model_path
to_path = os.path.join(model_folder, model_name)
else:
from_path = os.path.join(model_folder, model_name)
to_path = cnn.model_path
shutil.copyfile(from_path, to_path)
print("Persisted model copied \n\tFrom: {} \n\tTo: {}".format(from_path, to_path))
Before we delve deeper in to comparing the performance metrics with varying hyperparametervalues, we will initialize a network with no further training. This is to ensure that all the runs are performed with equally initialized model and.
First we train the initial model for earlier dataset.
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta)
copy_model(cnn=cnn, is_later=False, save=True)
Then we train the later dataset initial model.
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta)
copy_model(cnn=cnn, is_later=True, save=True)
We will first perform benchmark trainings with no regularization. This is to see where the training would progress. We will then compare the regularized trainings to these to see the level of improvement attained.
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta)
copy_model(cnn=cnn, is_later=False, save=False)
cnn.load_model()
_ = cnn.train(
epochs=EPOCHS,
training_data=dp.DroneRGBEarlier(DB_128),
k_cv_folds=3)
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta)
copy_model(cnn=cnn, is_later=True, save=False)
cnn.load_model()
_ = cnn.train(
epochs=EPOCHS,
training_data=dp.DroneRGBLater(DB_128),
k_cv_folds=3)
First we research the optimal weight decay by running series of grid searches. We want to first see if there is a coarse area of better test errors achieved with only a maximum of 50 epochs. After that we'll use this information to perform random searches in this neighborhood of best coarse values drawing random samples from a normal distribution with mean corresponding to the dataset-wise lowest grid search and standard deviation focusing the values around that particular mean value.
def test_weight_decay(dataset, weight_decays):
best_losses = pd.DataFrame(
columns=['weight_decay', 'best_loss', 'loss_mean', 'loss_std'])
for weight_decay in weight_decays:
print("weight_decay={}".format(
weight_decay))
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta,
optimizer_parameters={'weight_decay': weight_decay})
copy_model(cnn=cnn, is_later=isinstance(dataset, dp.DroneRGBLater), save=False)
cnn.load_model()
losses_dict = cnn.train(
epochs=EPOCHS,
training_data=dataset(DB_128),
k_cv_folds=3,
suppress_output=True)
losses = np.array(losses_dict['test_losses_mean_std'])[:, 0]
best_losses = best_losses.append(
{'weight_decay': weight_decay,
'best_loss': losses.min(),
'loss_mean': losses.mean(),
'loss_std': losses.std()},
ignore_index=True)
return best_losses.sort_values(by='weight_decay').reset_index(drop=True)
#(weight_decay, best_loss)
best_l2_earlier = (None,None)
best_l2_later = (None,None)
So first we perform a crude grid search with multiple values of powers of ten.
l2_earlier = test_weight_decay(dp.DroneRGBEarlier, [1e-3, 1e-2, 1e-1, 1e0, 1e1])
l2_earlier.plot(x='weight_decay', y='best_loss', logx=True, grid=True, title='RGB Earlier $L^2$ Grid Search')
l2_earlier
best_row = l2_earlier.loc[l2_earlier['best_loss'].idxmin()]
best_l2_earlier = (best_row['weight_decay'], best_row['best_loss'])
best_l2_earlier
l2_later = test_weight_decay(dp.DroneRGBLater, [1e-3, 1e-2, 1e-1, 1e0, 1e1])
l2_later.plot(x='weight_decay', y='best_loss', logx=True, grid=True, title='RGB Later $L^2$ Grid Search')
l2_later
best_row = l2_later.loc[l2_later['best_loss'].idxmin()]
best_l2_later = (best_row['weight_decay'], best_row['best_loss'])
best_l2_later
Then we perform a zoomed search around the optimal values.
There is one discussion point however, and that is the lowest test loss produced for RGB Earlier dataset. When comparing the loss progression with graphs, it seems that the lowest loss with weight_decay=0.001 could be attributable to just random fluctuations. To support this notion, the graphs of other regularization values produce more values close to the lowest attained loss more coherently, while the loss of the first regularization test with the first dataset jumps back up to 550ish test loss range.
Thus it seems that for both datasets we could use similar base ballpark of random search values. One option would to use a normal distribution. Another is to use a limited range uniform distribution. We will go with normal to properly zoom to a range of optimal values.
import matplotlib.pyplot as plt
import numpy as np
_ = plt.hist(np.random.normal(loc=1e-1, scale=3*1e-2, size=1000), bins=100)
l2_earlier = test_weight_decay(dp.DroneRGBEarlier,
np.abs(np.random.normal(loc=best_l2_earlier[0],
scale=5*best_l2_earlier[0]*0.1,
size=10)))
l2_earlier.plot(x='weight_decay', y='best_loss', logx=True, grid=True)
l2_earlier
best_row = l2_earlier.loc[l2_earlier['best_loss'].idxmin()]
if best_row['best_loss'] < best_l2_earlier[-1]:
best_l2_earlier = (best_row['weight_decay'], best_row['best_loss'])
best_l2_earlier
result_later = test_weight_decay(dp.DroneRGBLater,
np.abs(np.random.normal(loc=best_l2_later[0],
scale=5*best_l2_later[0]*0.1,
size=10)))
result_later.plot(x='weight_decay', y='best_loss', logx=True, grid=True)
result_later
best_row = l2_later.loc[l2_later['best_loss'].idxmin()]
if best_row['best_loss'] < best_l2_later[-1]:
best_l2_later = (best_row['weight_decay'], best_row['best_loss'])
best_l2_later
best_l2_earlier = (0.013287424820534941, 402.9876080024533)
best_l2_later = (0.001, 388.20532696474464)
print("RGB Earlier")
print("\tWeight Decay: {}".format(best_l2_earlier[0]))
print("\tBest Loss: {}".format(best_l2_earlier[1]))
print("RGB Later")
print("\tWeight Decay: {}".format(best_l2_later[0]))
print("\tBest Loss: {}".format(best_l2_later[1]))
Next we test out multiple setting of early stopping. With early stopping it is adives by Goodfellow et al. (2016) to perform sequential training after the training has been terminated early using the same termination setting. We will thus try out several values for the early stopping patience, which is a number determining how many non-improving epochs we allow the training to pass before terminating it. We try with patiences 10, 20, 30, 40 and 50.
def test_early_stopping(dataset, weight_decay, patiences):
best_losses = pd.DataFrame(
columns=['patience', 'best_loss', 'epochs', 'loss_mean', 'loss_std'])
for patience in patiences:
print("patience={}".format(
patience))
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta,
optimizer_parameters={'weight_decay': weight_decay})
copy_model(cnn=cnn, is_later=isinstance(dataset, dp.DroneRGBLater), save=False)
cnn.load_model()
losses = []
losses_dict = cnn.train(
epochs=250,
training_data=dataset(DB_128),
k_cv_folds=3,
early_stopping_patience=patience)
losses = list(np.array(losses_dict['test_losses_mean_std'])[:, 0])
losses_dict = cnn.train(
epochs=250,
training_data=dataset(DB_128),
k_cv_folds=3,
early_stopping_patience=patience)
losses += list(np.array(losses_dict['test_losses_mean_std'])[:, 0])
losses = np.array(losses).flatten()
best_losses = best_losses.append(
{'patience': patience,
'epochs':losses.size,
'best_loss': losses.min(),
'loss_mean': losses.mean(),
'loss_std': losses.std()},
ignore_index=True)
return best_losses.sort_values(by='patience').reset_index(drop=True)
patience_earlier = test_early_stopping(
dataset=dp.DroneRGBEarlier,
weight_decay=best_l2_earlier[0],
patiences=[10,20,30,40,50])
patience_later = test_early_stopping(
dataset=dp.DroneRGBLater,
weight_decay=best_l2_later[0],
patiences=[10,20,30,40,50])
The results of early stopping are given for both datasets in the following tables:
print("RGB Earlier")
patience_earlier
print("RGB Later")
patience_later
import matplotlib.pyplot as plt
plt.subplot(211)
plt.plot(patience_earlier['patience'],patience_earlier['best_loss'],label='RGB Earlier')
plt.plot(patience_later['patience'],patience_later['best_loss'],label='RGB Later')
plt.title("Early Stopping Lowest Loss")
plt.xlabel("Patience")
plt.ylabel("Lowest Loss")
plt.xlim(10,50)
plt.grid()
plt.legend()
plt.subplot(212)
plt.plot(patience_earlier['patience'],patience_earlier['epochs'],label='RGB Earlier')
plt.plot(patience_later['patience'],patience_later['epochs'],label='RGB Later')
plt.title("Early Stopping Total Epochs Trained")
plt.xlabel("Patience")
plt.ylabel("Total Epochs")
plt.xlim(10,50)
plt.grid()
plt.legend()
plt.tight_layout()
plt.show()
The increase in in patience seems have the average effect of linearly increasing the training time. However at the same time the losses behave in convex way, where the minimum is somewhere between 150 and 300 total epochs trained. We will use patience of 30 for the earlier and 30 for the later dataset.
The last step is to tune the hyperparameters of the optimizer. Adadelta has effectively two parameters that can be tuned. These are the initial learning rate lr that will be dynamically changed by the optimizer and the coefficient for running average of squared gradients rho used to determinate the changing of the learning rate.
In the original Adadelta paper they performed hyperparameter tuning with just 6 epochs on the digit classification task. We will be a bit more generous to our network and use 25 epochs to determine the optimal settings. This means that we wont be utilizing early stopping here, but we will incorporate weight decay.
We will first conduct coarse grid search and random searh after that if necessary. We will use same initialized models as in the regularization testing phase.
import os
import shutil
import numpy as np
import pandas as pd
import seaborn as sns
from torch import optim
from field_analysis.model.dataset import dataperiod as dp
from field_analysis.model.nets.cnn import DroneYieldMeanCNN
import field_analysis.settings.model as model_settings
%matplotlib inline
DB_128 = 'field_analysis_40m_128px.db'
DATASET_NAMES = ['earlier', 'later']
optimized_models_dir = os.path.join(model_settings.MODELS_DIR,'optimization')
os.makedirs(optimized_models_dir,exist_ok=True)
regularized_models_dir = os.path.join(model_settings.MODELS_DIR,'regularization')
for dataset_name in DATASET_NAMES:
model_name = "initial_model_{}.pkl".format(dataset_name)
shutil.copyfile(
os.path.join(regularized_models_dir,model_name),
os.path.join(optimized_models_dir,model_name))
assert os.path.isfile(os.path.join(optimized_models_dir,model_name))
def copy_model(cnn, is_later, save):
"Copy the dataset-wise persisted model either for later use (`save=True`) or current use (`save=False`)."
cnn.model_path = os.path.join(optimized_models_dir,cnn.model_filename)
model_folder, _ = os.path.split(cnn.model_path)
model_name, suffix = cnn.model_filename.split('.')
model_name = "initial_model_{}.{}".format(
DATASET_NAMES[is_later], suffix)
if save:
cnn.save_model()
from_path = cnn.model_path
to_path = os.path.join(model_folder, model_name)
else:
from_path = os.path.join(model_folder, model_name)
to_path = cnn.model_path
shutil.copyfile(from_path, to_path)
print("Persisted model copied \n\tFrom: {} \n\tTo: {}".format(from_path, to_path))
def test_optimizer(dataset, weight_decay, lrs, rhos):
best_losses = pd.DataFrame(
columns=['lr', 'rho', 'best_loss', 'loss_mean', 'loss_std'])
for lr in lrs:
for rho in rhos:
print("lr={}, rho={}".format(lr, rho))
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta,
optimizer_parameters={
'weight_decay': weight_decay,
'lr': lr,
'rho': rho})
copy_model(
cnn=cnn,
is_later=isinstance(dataset, dp.DroneRGBLater),
save=False)
cnn.load_model()
losses_dict = cnn.train(
epochs=50,
training_data=dataset(DB_128),
k_cv_folds=3,
suppress_output=True)
losses = np.array(losses_dict['test_losses_mean_std'])[:, 0]
best_losses = best_losses.append(
{'lr': lr,
'rho': rho,
'best_loss': losses.min(),
'loss_mean': losses.mean(),
'loss_std': losses.std()},
ignore_index=True)
return best_losses.sort_values(by='best_loss').reset_index(drop=True)
# (lr, rho, loss)
best_optimizer_earlier = (None, None, None)
best_optmizer_later = (None, None, None)
For the initial values we will use learning rates of 1e-4,1e-3,1e-2,1e-1 and 1e0. For the running average coefficient we will use 0.0,0.3,0.6 and 0.9. This totals to 20 trainings per dataset.
optimizer_earlier = test_optimizer(dp.DroneRGBEarlier,
best_l2_earlier[0],
[1e-4,1e-3,1e-2,1e-1,1e0],
[0,0.3,0.6,0.9])
pivot = optimizer_earlier.pivot_table(values='best_loss',index='lr',columns='rho')
sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlGn_r', linewidth=1, linecolor='white')
pivot
best_row = optimizer_earlier.loc[optimizer_earlier['best_loss'].idxmin()]
best_optimizer_earlier = (best_row['lr'], best_row['rho'], best_row['best_loss'])
best_optimizer_earlier
# best_optimizer_earlier = (1.0, 0.9, 396.4310816322885)
optimizer_later = test_optimizer(dp.DroneRGBLater,
best_l2_later[0],
[1e-4,1e-3,1e-2,1e-1,1e0],
[0,0.3,0.6,0.9])
pivot = optimizer_later.pivot_table(values='best_loss',index='lr',columns='rho')
sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlGn_r', linewidth=1, linecolor='white')
pivot
best_row = optimizer_later.loc[optimizer_later['best_loss'].idxmin()]
best_optimizer_later = (best_row['lr'], best_row['rho'], best_row['best_loss'])
best_optimizer_later
# best_optimizer_later = (0.1, 0.3, 353.75241492030295)
From the results it seems that the optimal learning rate is in the ballpark of 0.01 for the earlier and 0.1 for the later dataset. The coefficient for the squared gradients is however around 0.3 for both datasets.
Next up the random search with values in the ballpark of the ones concluded in the grid search.
optimizer_earlier = test_optimizer(
dp.DroneRGBEarlier,
best_l2_earlier[0],
np.abs(np.random.normal(loc=best_optimizer_earlier[0],
scale=2*best_optimizer_earlier[0]*0.1,
size=4)),
np.abs(np.random.normal(loc=best_optimizer_earlier[1],
scale=2*best_optimizer_earlier[1]*0.1,
size=4)))
pivot = optimizer_earlier.pivot_table(values='best_loss',index='lr',columns='rho')
sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlGn_r', linewidth=1, linecolor='white')
pivot
best_row = optimizer_earlier.loc[optimizer_earlier['best_loss'].idxmin()]
if best_row['best_loss'] < best_optimizer_earlier[-1]:
best_optimizer_earlier = (best_row['lr'], best_row['rho'], best_row['best_loss'])
best_optimizer_earlier
optimizer_later = test_optimizer(
dp.DroneRGBLater,
best_l2_later[0],
np.abs(np.random.normal(loc=best_optimizer_later[0],
scale=2*best_optimizer_later[0]*0.1,
size=4)),
np.abs(np.random.normal(loc=best_optimizer_later[1],
scale=2*best_optimizer_later[1]*0.1,
size=4)))
pivot = optimizer_later.pivot_table(values='best_loss',index='lr',columns='rho')
sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlGn_r', linewidth=1, linecolor='white')
pivot
best_row = optimizer_later.loc[optimizer_later['best_loss'].idxmin()]
if best_row['best_loss'] < best_optimizer_later[-1]:
best_optimizer_later = (best_row['lr'], best_row['rho'], best_row['best_loss'])
best_optimizer_later
#best_optimizer_earlier = (1.0, 0.9, 396.4310816322885)
#best_optimizer_later = (0.1, 0.3, 353.75241492030295)
print("RGB Earlier")
print("\tLearning Rate: {}".format(best_optimizer_earlier[0]))
print("\tMoving Mean Gradient Coefficient: {}".format(best_optimizer_earlier[1]))
print("\tBest Loss: {}".format(best_optimizer_earlier[2]))
print("RGB Later")
print("\tLearning Rate: {}".format(best_optimizer_later[0]))
print("\tMoving Mean Gradient Coefficient: {}".format(best_optimizer_later[1]))
print("\tBest Loss: {}".format(best_optimizer_later[2]))
We then want to see if tuning the optimizer resulted in better loss than using only vanilla default values.
import os
import shutil
import numpy as np
import pandas as pd
import seaborn as sns
from torch import optim
from field_analysis.model.dataset import dataperiod as dp
from field_analysis.model.nets.cnn import DroneYieldMeanCNN
import field_analysis.settings.model as model_settings
%matplotlib inline
DB_128 = 'field_analysis_40m_128px.db'
DATASET_NAMES = ['earlier', 'later']
optimized_models_dir = os.path.join(model_settings.MODELS_DIR,'optimization')
def copy_model(cnn, is_later, save):
"Copy the dataset-wise persisted model either for later use (`save=True`) or current use (`save=False`)."
cnn.model_path = os.path.join(optimized_models_dir,cnn.model_filename)
model_folder, _ = os.path.split(cnn.model_path)
model_name, suffix = cnn.model_filename.split('.')
model_name = "initial_model_{}.{}".format(
DATASET_NAMES[is_later], suffix)
if save:
cnn.save_model()
from_path = cnn.model_path
to_path = os.path.join(model_folder, model_name)
else:
from_path = os.path.join(model_folder, model_name)
to_path = cnn.model_path
shutil.copyfile(from_path, to_path)
print("Persisted model copied \n\tFrom: {} \n\tTo: {}".format(from_path, to_path))
def test_optimizer_full(dataset, weight_decay, patience, lr, rho):
cnn = DroneYieldMeanCNN(
source_bands=3,
source_dim=128,
cnn_layers=6,
optimizer=optim.Adadelta,
optimizer_parameters={
'weight_decay': weight_decay,
'lr': lr,
'rho': rho})
copy_model(
cnn=cnn,
is_later=isinstance(dataset, dp.DroneRGBLater),
save=False)
cnn.load_model()
cnn.train(
epochs=250,
training_data=dataset(DB_128),
k_cv_folds=3,
early_stopping_patience=patience)
cnn.train(
epochs=250,
training_data=dataset(DB_128),
k_cv_folds=3,
early_stopping_patience=patience)
test_optimizer_full(dataset=dp.DroneRGBEarlier,
weight_decay=best_l2_earlier[0],
patience=30,
lr=best_optimizer_earlier[0],
rho=best_optimizer_earlier[1])
test_optimizer_full(dataset=dp.DroneRGBLater,
weight_decay=best_l2_later[0],
patience=30,
lr=best_optimizer_later[0],
rho=best_optimizer_later[1])